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Food and beverage authentication is the process by which foods 
or beverages are verified as complying with its label description, for 
example, verifying if the denomination of origin of an olive oil bot- 
tle is correct or if the variety of a certain bottle of wine matches 
its label description. The common way to deal with an authentica- 
tion process is to measure a number of attributes on samples of food 
and then use these as input for a classification problem. Our moti- 
vation stems from data consisting of measurements of nine chemi- 
cal compounds denominated Anthocyanins, obtained from samples 
of Chilean red wines of grape varieties Cabernet Sauvignon, Merlot 
and Carmenere. We consider a model-based approach to authentica- 
tion through a semiparametric multivariate hierarchical linear mixed 
model for the mean responses, and covariance matrices that are spe- 
cific to the classification categories. Specifically, we propose a model 
of the ANOVA-DDP type, which takes advantage of the fact that the 
available covariates are discrete in nature. The results suggest that 
the model performs well compared to other parametric alternatives. 
This is also corroborated by application to simulated data. 

1. Introduction. Food and beverage authentication is the process in 
which foods or beverages are verified as complying with its label descrip- 
tion [Winterhalter (2007)]. From the viewpoint of consumers' acquisition, 
the mislabeling of foods represents commercial fraud [Mafra et al. (2008)]. 
On the other hand, producers and sellers could have problems if their prod- 
ucts are mislabeled. Food authentication is important for foods and bever- 
ages of high commercial value, like honey, wines or olive oil, because their 
prices depend of their quality, variety or origin. It is then important to un- 
cover unscrupulous sellers who decide to increase their profit by adulterating 
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these products with similar but lower quality substances. Misleading label- 
ing might also have negative health implications, especially when the food 
has undeclared allergenic compounds. 

Because of the growing demand from consumers of clarity and certainty 
in food origins and contents, the importance of food authentication has 
substantially increased in recent years. Many analytical tools and methods 
used for authenticity have been consequently developed. In particular, there 
is a very active area of research on the determination of chemical markers 
for classification and/or authentication of wines. Anthocyanin profiles are 
known to be specially useful for the purpose of wine variety authentication. 
See, for example, Eder, Wendelin and Barna (1994), Berente et al. (2000), 
Holbach, Marx and Ackerman (2001), Revilla et al. (2001), Otteneder, Marx 
and Zimmer (2004) and von Baer et al. (2007). 

Data analysis methods for authentication purposes have been developed 
mainly outside the statistics fields, and most of them are exploratory tech- 
niques designed to deal with multivariate data sets. Probabilistic model- 
ing for discrimination and authentication purposes was proposed by Brown, 
Fearn and Haque (1999), who used Bayesian methods to discriminate 39 
microbiological taxa using their reflectance spectra. More recently, Dean, 
Murphy and Downey (2006) used a Gaussian mixture model with labeled 
and unlabeled samples, with application to the authentication of meat sam- 
ples from five species, and the geographic origin of olive oils. Toher, Downey 
and Brendan (2007) compared model-based classification methods such as 
Gaussian mixtures with partial least squares discriminant analysis, consid- 
ering samples of pure and adulterated honey. 

We propose a model-based procedure to solve the authentication problem 
of foods and beverages. The motivation comes from a data set consisting of 
measurements of nine chemical compounds denominated Anthocyanins, ob- 
tained from samples of Chilean red wines of grape varieties Cabernet Sauvi- 
gnon, Merlot and Carmenere. We propose a semiparametric Bayesian model 
that allows us to define a flexible distribution G for the joint measurements. 
The model has the advantage of not having to assume any parametric form, 
which may be particularly difficult to check in multivariate cases. Increased 
flexibility is added by allowing G to be formulated under the formalism of 
dependent random probability measures as in De lorio et al. (2004). An 
interesting aspect of the proposed approach is the need to extend previous 
univariate semiparametric models as in De la Cruz-Mesia, Quintana and 
Miiller (2007) to the multivariate case. 

The rest of the paper is organized as follows. We first present the wine 
data set and the related authentication problem in Section 2. In Section 3 we 
give a brief theoretical background about Bayesian semiparametric models 
and dependent Dirichlet processes, and discuss our approach to the authen- 
tication problem. In Section 4 we present the model, which is an extension 
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of the univariate semiparametric Bayesian linear mixed model [Dey, Miiller 
and Sinha (1998)] to the multivariate case. In Section 5 we illustrate the 
performance of the proposed model in a simulated data set. In Section 6 we 
apply the model to authenticate red wines samples based on their antho- 
cyanin profile. The paper concludes in Section 7 with a discussion and final 
remarks. 

2. The motivating data set. We consider a data set consisting of mea- 
surements of concentrations of nine anthocyanins on samples of Chilean 
red wines. Anthocyanins are a group of chemical compounds present in 
red wine, which confer to this beverage its characteristic red color and 
are transferred from the grape skins to wine during the winemaking pro- 
cess. The data set includes the grape variety for each sample as declared 
by the producer, the year of harvest and the geographical origin or valley. 
The grape varieties in the data set are Cabernet Sauvignon (228 samples), 
Carmenere (95 samples) and Merlot (76 samples). These samples form a data 
set with mixed wine types, which represents the most abundant red grape 
varieties cultivated in Chile across different valleys. The data are unbal- 
anced, and there are combinations of variety and location for which no 
observations are available. All wine samples came directly from wineries 
located in the valleys of Aconcagua, Maipo, Rapel, Curico, Maule, Itata 
and Bfo-Bio in Chile. They correspond to the vintages 2001, 2002, 2003 
and 2004. Anthocyanin determination was made by reverse phase High Per- 
formance Liquid Chromatography (HPLC), a Chromatographic technique 
that can separate a mixture of compounds and is used in analytical chem- 
istry to identify, quantify and purify the individual components of complex 
mixtures, like wines and others beverages or foods. The analytical chem- 
istry procedure was based on the method described by Holbach, Marx and 
Ackermann (1997), Otteneder et al. (2002) and by the International Orga- 
nization of Vine and Wine (OIV) [OIV (2003)], with some minor modifica- 
tions. More details about anthocyanin determination for this data set can be 
found in von Baer et al. (2005, 2007). A main concern for these data is the 
authentication of grape variety using the log-concentrations of the follow- 
ing anthocyanins: delphinidin-3-glucoside (DP), cyanidin-3-glucoside (CY), 
petunidin-3-glucoside (PT), peonidin-3-glucoside (PE), malvidin-3-glucoside 
(MV), peonidin-3-acetylglucoside (PEAC), malvidin-3-acetylglucoside 
(MVAC), peonidin-3-coumaroylglucoside (PECU) and malvidin-3-couma- 
roylglucoside (MVCU). To do so, we propose in Section 4 a multivariate 
linear mixed model that attempts to characterize the variability in antho- 
cyanin log-concentrations in terms of variety and valley of origin. We also 
point out that we ignore vintage year in our development. The pragmati- 
cal reason for this is that by doing so we may easily incorporate data from 
new years as they become available, without the need to modify the model. 
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In support of this choice, we refer to Gutierrez et al. (2011) who used the 
year of harvest as a continuous predictor when proposing a Bayesian para- 
metric model for the same data. The idea was to overcome this very same 
limitation. Yet, the effect of vintage year was negligible in that context. 

3. Some background material. Semiparametric models have both para- 
metric and nonparametric parts, the distinction between these being that 
the parameters belong to a finite and infinite-dimensional space, respec- 
tively. Semi- and nonparametric Bayesian models are used mainly to avoid 
critical dependence on parametric assumptions. An important application 
of such modeling strategy is to random effects distributions in hierarchical 
models, where often little is known about the specific form of such distri- 
butions [Miiller and Quintana (2004)]. To handle the nonparametric part of 
the model, we need to define a random measure on the space of distribu- 
tion functions. The most popular such choice is the Dirichlet process (DP) 
[Ferguson (1973)]. 

In a food authentication context scenario, we need to build a model that 
adequately accounts for all the problem-specific features. In the specific case 
of our motivating data set, it is reasonable to think of wines coming from 
the same valley as being correlated, because soil and weather conditions 
are similar within a given valley. The usual (and simplest) way to induce 
a correlation structure is by incorporating random effects or sample spe- 
cific parameters in a model. Let ai denote the random effects and let Zi be 
a categorical covariate with k levels (e.g., k different regions of origin). We 
could assume a single nonparametric prior on Oj for all samples, without 
reference to the levels of Zj. Alternatively, we could consider differences by 
putting k independent priors on ai. These two extreme modeling strategies 
imply that Gz^ = ■ ■ ■ = Gz^. for the former and Gzj , ■ ■ ■ , Gz^. to be mutu- 
ally independent for the latter. MacEachern (1999) proposes a modeling 
strategy, the Dependent Dirichlet Processes (DDP), that allows the set of 
random effects distributions to be similar but not identical to each other. 
MacEachern (1999) defines a nonparametric probability model for Gz in such 
a way that marginally, for each z = Zj {j = 1, . . . ,k), the random measure Gz 
follows a DP. In this context, the DP representation proposed by Sethura- 
man (1994) is quite useful. Sethuraman's representation establishes that any 
G ~ DP(M, Go) can be represented as an infinite mixture of point masses: 

oo 
h=l 

(1) 

Wh = Uh'[l{l- Uj) with Uh ''^ ■ Beta(l, M). 
j<h 

The key idea behind the DDP is to introduce dependence across the Gz 
measures by assuming the distributions of the point masses to be depen- 



MULTIVARIATE BAYESIAN SEMIPARAMETRIC MODELS 



5 



dent across different levels of z (i.e., \izh)^ but still independent across h. 
If the weights are assumed to be the same across z, the dependent prob- 
ability measure can be represented as Gz{-) = Ylh'=i''^hSf^^h- '^be last idea 
was used by De lorio et al. (2004) in the construction of an ANOVA-DDP 
type model. A similar approach was used in spatial modeling by Gelfand, 
Kottas and MacEachern (2005), who proposed a Gaussian process for the 
atoms, Caron et al. (2006) in times series, De la Cruz-Mesia, Quintana and 
Miiller (2007) in classification, De lorio et al. (2009) in survival analysis and, 
recently, by Jara et al. (2010) who proposed a Poisson-Dirichlet process for 
the analysis of a data set coming from a dental longitudinal study. Griffin 
and Steel (2006) point out that letting only the atoms depend on covari- 
ate values may lead to certain problems when points in the domain are far 
from the observed data. They propose an approach that avoids this by lo- 
cally updating the process and inducing dependence in the weights through 
distance-based similarities in the ordering of atoms, through viewing the 
atoms as marks in a point process. Other works where covariate dependence 
is introduced in the weights are Dunson, Pillai and Park (2007) and Dunson 
and Park (2008). Miiller, Erkanli and West (1996) considered a completely 
different approach for inducing dependence in G. They used a DP mixture 
of normals for the joint distribution of y and z, and then focused on the 
implied conditional density of y given z for estimating the mean regression 
function. A recent reference about nonparametric Bayesian statistics, DDP 
models and their applications can be found in Hjort et al. (2010). 

The almost sure discreteness of the Dirichlet process makes it inappro- 
priate as a model for a continuous quantity y. A standard procedure for 
overcoming this difficulty is to introduce an additional convolution so that 

(2) H{y) = jf{y\ 9) dG{e) with G ~ DP(A/, Go). 

Such models are known as DP mixtures (DPM) [Antoniak (1974)]. The 
mixture model (2) can be equivalently written as a hierarchical model by 
introducing latent variables 6i and breaking the mixture as 

(3) y^\ei^f{yi\ei), Oi^G and G~DP(M,Go). 

For the majority of food authentication problems the responses are con- 
tinuous multivariate and covariates are discrete. This is the case for the 
data described in Section 2. Thus, we will adopt the popular semiparamet- 
ric modeling strategy that consists of introducing dependence in the random 
effects distribution and then adding a convolution with a continuous kernel. 
The ANOVA-DDP approach of De lorio et al. (2004) is a natural way to 
build the desired dependence into the model, as will be discussed below in 
Section 4. We remark here that a model that defines dependence in terms 
of distances would not be appropriate for an authentication problem with 
categorical covariates, as is our case. 



6 



L. GUTIERREZ AND F. A. QUINTANA 



4. The model. We first note that due to the multivariate nature of many 
authentication problems (which is also the case of the wine data) , it would 
not be appropriate to treat the individual responses in an univariate way. 

We assume that the ith. response vector is related to the covariates in 
a linear way. Furthermore, we assume that there are fixed and random effects 
in the model. The model for the ith unit in the uth group or class in our 
classification context is thus assumed to be given by 

Qiu ~ Hz{9iu), 

(4) 

Hz{e)= [ N{e\za,T)dG{a), 



G~DP(M, Go), 

where yiu is a vector of responses in K^, B is a pxq matrix of fixed effects, 
is a vector of covariates in R'^ , 9iu is a p x 1 vector of unit-specific random 
effects, Ziu is a p X pk design matrix for random effects and a is a pA; x 1 
vector of latent variables that define the random effects. For classification 
purposes, the subscript u denotes the group or class. Model (4) implies that 
Hz{9) = Ylh=i WhN{6 I za/i, r) is an infinite mixture of normal distributions. 
As usual in mixture models, posterior simulation proceeds by breaking the 
mixture in (4) by introducing latent variables ai: 

(5) 9iu = Ziuai + i]u ai^G, G~DP(M,Go) and ?7i ~ iVp(0, r). 

By simplicity, we choose a multivariate normal model for the base mea- 
sure Go = Npk{0,R) and as usual in this context, we assume prior indepen- 
dence for all remaining parameters. The prior distribution for matrix B = 
[Pii ^2, ■ ■ ■ , is assumed to be independent by columns, that is, /5i, /32, ■ • ■ , /5g 
are mutually independent with distribution given by 

(6) /3i,...,/3,-Afp(/3oj,A), j = l,...,q. 

The prior distributions for the variance-covariance matrices u = 1, . . . , m, 
and r are given by 

(7) Si,...,S^~/W^p(zyo,Qo), r~/VFp(7o,$o). 

We complete the Bayesian formulation of model (4) by specifying the prior 
for hyperparameters R, /3oi, . . . ,(3og, A and M as 

(8) R^IWpkiro,Ro), /3oi,...,/3og~A^p(ao,To), 

(9) A~/Wp(Lo,to), M~Ga(ai,a2). 

The random distribution Hz{0) in model (4) is dependent of the level 
of covariate z. As such, this follows the model proposed by De lorio et al. 
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(2004), but our focus is on the application to multivariate data. Matrix R 
in the model allows for correlation between all components of vector Oj, 
which implies correlation between different components of the response vec- 
tor and between different levels of z. The full conditional posterior dis- 
tributions and details of the posterior simulation scheme are given in the 
Appendix. 

Consider now a general classification approach, and denote the train- 
ing data set by ?/" = {yi, . . . ,yn,xi, . . . ,Xn, zi, . . . , Zn, gi, . . . ,gn)- Here, yiu = 
{Vi'- di = u), ti = 1, . . . ,m, is the response vector for the uth. group, xiu = 
{xi :gi = u) is the vector of covariates for fixed effects, zi^ = {zi :gi = u) is 
a vector of covariates for random effects and gi represents the known group 
label for the ith unit. Consider a new unit for which the response yn+i and 
covariate vectors Xn+i and Zn+i are known, but its label gn+i is unknown. 
We want to assign a label u to the new unit, where u £ {1, . . . ,m}. Conse- 
quently, it is necessary to estimate the classification probability P{gn+i = u \ 
yn+i,y"')- Following De la Cruz-Mesia and Quintana (2007) and Gutierrez 
et al. (2011), we use 



(10) P(5„+i = u I y„+i,y") 



1 T^uP{yn+l I &u^) 



In (10), 7r„ = P{gi = u) may be taken as the empirical group proportions. 
We propose classifying an existing unit, i, and a future one, n + 1, using the 
zero-one law considered in Hastie, Tibshirani and Friedman (2001): 

(ji = argmaxP((7j = u \ y"") and 

(U) 

gn+1 = argmaxP(5„+i = u j y",y„,+i), 

u 

that is, assigning the label as the category that maximizes the classification 
probability (10). Note that (10) and (11) are part of a generic classifica- 
tion approach which is based on predictive distributions. In particular, (10) 

relates to the model described in (4) through the quantities, which 
represent samples from the posterior distribution for model (4), given the 
training data set y". For the wine data analysis later in Section 6, we will let 
the fixed effects be varieties and random effects be the different regions of 
origin. For classification purposes, we are assuming that a new wine sample 
is available, for which we know its anthocyanin concentrations and valley, 
and the aim is to predict its variety. Note that the variety here corresponds 
to the label u and there are not other covariates for fixed effects, so the rows 
on the matrix of a fixed effects are given by Xi = (1,0,0) if (^j = 1 (Caber- 
net Sauvignon), Xi = (0,1,0) if i^j = 2 (Merlot) and Xj = (0,0,1) ii gi = 3 
(Carmenere). 
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Fig. 1. Simulated data set. 



5. Classification performance of the proposed model. To evaluate the 
classification performance of the proposed model, we simulated a data set 
considering m = 2, n = 100, p = 2, q = 2, k = 2. We simulated from a mix- 
ture of p-variate normal distributions, ^^^-^ a;jA^(/ij, S), where uji,..., ujs 
are given by (0.25, 0.12, 0.13, 0.1, 0.1, 0.05, 0.12, 0.13), respectively, m = 
(1.1,2.3)*, fi2 = (0.1,-2)*, ^3 = (1.3,5)*, /i4 = (-3,3.4)*, ^5 = (-0.1,7)*, 
^6 = (1-8,5)*, //7 = (-4,l)*, /i8 = (l,-2)* and S is given by cJn = 0.932, 
(T12 = 0.11 and (T22 = 1.632. Figure 1 shows the simulated data set. Here, 
m = 2 means that we have to classify between two categories with k = 2 
levels for the covariate z. The hyperparameter values were taken as = 
(0, 0)*, To = 100/2, Qo = h, Lo = I2, = 4, ro = 4, to = 4, i?o = Ipk, 7o = 4, 

= 0.001/p and ai = 02 = 1. Table 1 shows the classification results of the 
proposed Bayesian semiparametric model (BSP), comparing with linear dis- 



Table 1 

Classification performance for the simulated data set. Values within parenthesis were 
obtained using leave-one-out cross-validation technique 







BSP 




BP 




LDA 


Category 


1 


2 


1 


2 


1 


2 


1 


47 (43) 


3(7) 


47 (35) 


3 (15) 


42 (42) 


8(8) 


2 


3(9) 


47 (41) 


9(9) 


41 (41) 


17 (19) 


33 (31) 
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Fig. 2. ROC curves for classification of the simulated data set under Bayesian semipara- 
metric model (BSP), Bayesian parametric (BP) and linear discriminant analysis (LDA). 

criminant analysis (LDA) , which is the usual technique used in the literature 
for this type of problem, and a parametric (BP) version of model (4), defined 
as 

iViu I Xiu,Ziu) ~ Np{Bxiu + Oiu,Y.u), i = l,...,n,u=l,...,m, 
(12) Oiu = ZiuO + 7]i, r/j ~iVp(0,r), 

a^Npk{0,R). 

Using the proposed BSP model, we obtained a classification error of 7.0% 
in the training set and 16.0% using leave-one-out cross-validation (LOOCV). 
In contrast, the BP model resulted in a classification error of 12.0% in the 
training set and 24.0% under LOOCV, while the corresponding figures for 
the LDA were 25.0% and 27.0%, respectively. A common way to assess the 
performance of classification rules is the Receiver Operating Characteristic 
curve (ROC) shown in Figure 2, which plots the true positive rate against 
the false positive rate for all the different possible cutpoints. From the ROC 
curves we also calculated the Area Under ROC curve (AUC) for the three 
models, with higher values corresponding to models with better discrimi- 
nation capabilities. We obtained 0.9792 for the BSP model, 0.9334 for the 
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BP model, and 0.7464 for LDA. These results clearly suggest the superior- 
ity of the proposed BSP model for wine authentication in our simulation, 
compared to the other alternatives. 

Another important aspect of the analysis concerns comparing model ad- 
equacy of the BP versus our BSP proposal. To this effect, we calculated the 
Conditional Predictive Ordinates (CPOi) [Chen, Shao and Ibrahim (2000)], 
summarized in the log-pseudo marginal likelihood statistic LPML = 
Er=ilog(CPOi) [Geisser and Eddy (1979)]. Models with higher LPML val- 
ues are to be preferred. We found the proposed BSP to perform better than 
the BP, as the corresponding LPML values were —365.3 and —428.7, respec- 
tively. As an additional comparison we considered the Deviance Information 
Criterion (DIC) [Spiegelhalter et al. (2002)]. Models with lower DIC values 
are to be preferred. It is important to point out that several possible DIC 
construction are available. Celeux et al. (2006) discuss the DIC for finite 
mixture and random effects models, giving and comparing different DIC 
definitions. Our proposal is a mixture model, but the mixture is infinite. 
Indeed, due to the convolution in (4), the random effects {Oi} have continu- 
ous distributions. Nevertheless, we found some constructions of DIC to yield 
negative values for the effective dimension p^, and so we used DICi, DIC2 
and DIC3 of Celeux et al. (2006). The corresponding values were 721.4, 717.3 
and 723.1 for the BSP model, and 856.5, 854.4 and 855.4 for the BP model, 
and the conclusion is the same regardless of the specific construction. In 
summary, both LPML and DIC criteria consistently point to the superiority 
of the BSP model compared to the BP one. Overall, the results suggest that 
the BSP model is more flexible, especially when the data cluster between 
and within covariate levels. 

6. Performance of the model with wine data set. We consider now appli- 
cation of the proposed BSP model to the wine data set. The response vector 
is formed by the nine anthocyanins listed in Section 2. As covariates, we use 
grape variety (fixed effects) and valleys (random effects). The hyperparam- 
eter values were taken as oq = (0,0,0,0,0,0,0,0,0)*, tq = IOO/9, Qo = O.I/9, 
Lq = 0.01/9, z^o = 11, ro = 65, to = 11, Rq = WIpk, 70 = H, <Po = O.Ol/p and 
ai = 02 = 1, where p = 9, q = 3 and k = 7. The resulting prior densities 
are proper, but the one for B is vague and hence relatively uninformative. 
The prior density for R is relatively uninformative too. All the prior covari- 
ance matrices were assumed of diagonal form. The selected hyperparameter 
values imply proper but vague prior distributions, representing the lack of 
genuine prior information on the parameters. To further investigate how sen- 
sitive the results are to the above choices, we conducted a prior sensitivity 
analysis for hyperparameters (01,02), which control the implied clustering 
structure, and tq, which controls the prior variance for fixed effects. We tried 
the prior settings listed in the leftmost column of Table 2, which also shows 
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Table 2 

Posteriors means and standard deviations (S.D.) for some model quantities. For the 
combinations indicated m the leftmost column, we present summaries for M , r and for 
the fixed effects parameters for Cabernet Sauvignon corresponding to anthocyanins 

PT (Pvi) and MV (^15) 



PT MV 



Prior value 


Precision parameter 


Mean 


(S.D.) 


Mean 


(S.D.) 




= 02 =0.01 


A/ = 


1.41 (0.62) 


/3l3 


= 2.55 


(0.14) 


/?15 


= 4.91 


(0.12) 


ro 


= 100/9 






r 


= 0.15 


(0.0136) 


r 


= 0.08 


(0.0088) 


ai 


= 02 = 1 


M = 


1.65 (0.57) 


I3rs 


= 2.68 


(0.09) 


/?15 


= 5.07 


(0.11) 


ro 


= 100/9 






T 


= 0.16 


(0.0134) 


r 


= 0.08 


(0.0088) 


fli 


= 1, 02=0.1 


M = 


1.48 (0.56) 


PlZ 


= 2.76 


(0.12) 


/3l5 


= 5.03 


(0.11) 


To 


= 100/9 






T 


= 0.16 


(0.0129) 


r 


= 0.08 


(0.0081) 


fli 


= 10, 02 = 1 


M = 


3.27 (0.86) 


/3l3 


= 2.58 


(0.11) 


/3l5 


= 4.95 


(0.09) 


To 


= 100/9 






T 


= 0.16 


(0.0126) 


r 


= 0.08 


(0.0079) 


fli 


= 02 = 1 


M = 


1.43 (0.50) 


PlZ 


= 2.46 


(0.17) 


/3l5 


= 4.93 


(0.15) 


To 


= /9 






T 


= 0.16 


(0.0132) 


r 


= 0.08 


(0.0105) 


fll 


= 02 = 1 


M = 


1.21 (0.48) 


PlZ 


= 2.58 


(0.12) 


/3l5 


= 4.89 


(0.10) 


To 


= 10/9 






T 


= 0.16 


(0.0129) 


r 


= 0.08 


(0.0081) 


fll 


= 02 = 1 


M = 


1.82 (0.61) 


/3l3 


= 2.69 


(0.08) 


/3l5 


= 5.05 


(0.08) 


To 


= 1000/9 






r 


= 0.17 


(0.0134) 


r 


= 0.09 


(0.0093) 


ai 


= 02 = 1 


M = 


1.38 (0.52) 


/3l3 


= 2.57 


(0.14) 


/?15 


= 5.0 


(0.07) 


To 


= 10000/9 






r 


= 0.16 


(0.0125) 


r 


= 0.08 


(0.0082) 



the posterior means and standard deviations for some key parameters in the 
model. Specificany, we show posterior summaries for M, r and for the fixed 
effects parameters for Cabernet Sauvignon corresponding to anthocyanins 
PT and MV. We generally found no notorious changes in these summaries 
across different prior configurations. Since our emphasis is on classification, 
we also compared the shape of the predictive distributions and the classifi- 
cation itself for the prior configurations of Table 2. In general, we did not 
observe big changes in the predictive distributions (data not shown) and the 
classification was the same for all prior specifications. 

Table 3 shows the classification results, where the values within paren- 
thesis were obtained using a LOOCV approach. The classification error ob- 
tained in the training set was 0.5%, and 3.2% under LOOCV. These values 
are better than those obtained by Gutierrez et al. (2011) with the same data 
set but applying a Bayesian parametric model, namely, 3.0% in the training 
set and 3.5% using LOOCV. 

Table 4 shows the AUC values, which were calculated based on separate 
ROC curves for each grape variety, and for each of the BSP and BP mod- 
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Table 3 

Misclassification rate for the three grape varieties 



Variety 


Carmenere 


C. Sauvignon 


Merlot 


Error 


Carmenere 


94 (93) 


1 (1) 


0(1) 


1.1% (2.1%) 


C. Sauvignon 


0(0) 


228 (225) 


0(3) 


0.0% (1.3%) 


Merlot 


1 (8) 


0(0) 


75 (68) 


1.3% (10.5%) 


Total error 








0.5% (3.2%) 





Table 4 






Area under ROC curve 




Grape variety 


AUG BSM 


AUG BPM 


Cavernet Sauvignon 


0.999999 


0.9969221 


Merlot 


0.999999 


0.9967403 


Carmenere 


0.999999 


0.9963574 



els. All these values are very high, with the BSP model attaining the best 
performance across the three grape varieties. Comparing the BSP and BP 
models, we found that the DICi, DIC2, DIC3, and LPML statistics values 
were -5,901.5, -6,368.9, -5,769.9 and 2,430.1 for the former, and -3,659.1, 
—5,006.6, —3,446.2 and 1,351.6 for the latter. Again, these results suggest 
that the proposed BSP model provides a superior fit, and all the criteria val- 
ues are consistent. We also note that in all cases, the effective dimension 
for the Die was positive. 

Figure 3 displays bivariate posterior predictive distributions for Carmene- 
re wines from the valleys of Aconcagua, Maipo, Rapel and Curico, consid- 
ering the PECU and MVCU anthocyanins. The points on the graph are the 
observed values. We can see the changes in the posterior predictive distribu- 
tion across valleys. Predictions for the Aconcagua valley show less variation 
compared to the Maipo valley. Predictions for The Rapel valley show more 
variability, with some evidence of asymmetry, as dictated by the observed 
data, but the model provides a reasonable fit to this behavior. Finally, the 
Curico valley also exhibits asymmetry. 

Figure 4 shows the bivariate predictive posterior distributions for Caber- 
net Sauvignon, Carmenere and Merlot from the Curico valley considering the 
PECU and CY anthocyanins. This plot is interesting because it shows how 
informative are PECU and CY in terms of the target classification. These 
two anthocyanins show that some Merlot and Carmenere samples have very 
similar chemical profiles. This behavior is reasonable because some years 
ago, Carmenere, which in other countries disappeared due to phylloxera, 
was rediscovered in Chile. Formerly, all vineyards planted with this grape 
variety in Chile were declared as Merlot. Using SSR DNA markers to confirm 
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Fig. 3. Bivanate posterior predictive distributions under the BSP model for Carmenere 
wines from the Aconcagua, Maipo, Rapel and Curico valleys, with points representing ob- 
served values. The anthocyanins considered here were PECU and MVCU in the horizontal 
and vertical axes, respectively. 



varietal identity, Hinrichsen et al. (2001) found that from a total of 93 vines 
of five Chilean vineyards, originally planted as Merlot, four vines matched 
Carmenere. This leads to the conclusion that at the time of collecting wine 
samples, those vineyards declared as Carmenere are correctly identified with 
high probability, but a certain percentage of vineyards declared as Merlot 
still correspond to Carmenere. 

7. Concluding remarks. We have proposed a linear mixed effects model 
for wine authentication, featuring a flexible model for random effects that 
does not require to restrict ourselves to a given parametric form. We did so 
by resorting to Dependent Dirichlet Processes, which allow the set of random 
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effects distributions to be similar but not identical to each other, depending 
on values of a set of covariates. For the authentication problem, dependence 
on covariate levels is important because it is reasonable to think that foods 
or beverages that come either from the same region of origin, or those which 
were made with the same technology, could be similar or correlated. The 
ANOVA-DDP approach was suitable to our purposes, but other types of 
nonparametric priors could be considered. 

The proposed BSP model provided a better fit to the data than a para- 
metric alternative, as we showed in the simulation example and in the appli- 
cation to the wine data. In terms of the target classification, the BSP model 
also provided slightly better results than other alternatives. Our proposal 
was motivated by food authentication, but it could be used in any situation 
where the aim is to classify subjects or units into several groups, on the basis 
of multiple responses and covariates. 

APPENDIX 

In this section we give the MCMC algorithm that was used for posterior 
simulation under the proposed model. Because the model is of conjugate 
type, we use algorithm 2 in Neal (2000). Let c = (ci, . . . ,Cn) denote a vec- 
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tor that captures the clustering of Oi and let a = {oc ■ c E {ci, . . . , c„}). To 
resample the configurations Cj, we proceed with the following two steps: 

Step 1. If c = Cj for some j y^i, we compute the probability that the ith 
element in c equals other element in the same set as 

P{ci = c I c-i,6i,a) 

(13) =5^hv_(2vr)-P/Vr'/' 

X exp|-^(^j - ZiacYr'^iOi - ZiUc 

Here n^^c is the number of Cj that are qual to c, c_j are all the Cj for j 
and b is such that if c = Cj, then . j^i{P{ci = c)} + P(cj ^ Cj Vj 7^ i) = 1. 
Next, we compute the probability that Cj is different to any other element 
in c as 

P{ci / Cj for all j c-i, 9i,a) 

(14) =b — - — (27r)-p/vr'/'i^r'/'iAi'/' 



X exp|-^[^*T-i^, - [eiT~'z^mziT~'6.^]y 



If the imputed value of q, sampled based on (13) and (14), is not asso- 
ciated with any other observation, it is necessary to draw a value of 
from Hi, the posterior distribution for a based on the prior Gq and the 
single observation 9i. In our case Hi is given by Hi = Npk{ai, Di), where 
Di = [zjr-^z, + R-^]~\ and = 

Step 2. In the second step, for all c G {ci, . . . , c„} we draw a new value Oc 
given all the 6i for which Cj = c, that is, from the posterior distribution based 
on the prior Gq and all the data points currently associated with latent 
class c. In our case, this is given by Npk{ac, E), where E = [X^j. c =c^i''"~^^i"'" 
R-']-' ^^^da, = E[Y:,..,^=,ziT-%]. 

Now we list all the full conditional distributions for the parametric part of 
the model. The specific derivation details are straightforward and therefore 
omitted. 

• For fixed effect parameters we have 

f3j I other parameters and data~ Np(f3j,Vj), 

where 

^«^y]{a^ij2/i-3^ii3;i/^Ai XijXii^Pi^-Xijdi] >+A~Voj 
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and 



where 



' 9 ( riu 1 ~i 

u=i Ki=i ) 



{ll,l2,---,lq)^j, j = l,...,q. 

For the random effects parameters 9iu, ■ ■ ■ , Onu, u = ^, ■ ■ ■ , 9, we have that 

0iu I other parameters and data ~ Np{9iu,Qu), i = I, . . . ,n, 
where 

Qu = [t~^ + '^u^]~^ and 9iu = Qu[T~^Ziai + J:~'^yi-T,~^Bxi]. 
For hyperparameters /3oi , . . . , /3oq we have 

(3oj I other parameters and data ~ Np{/3Qj, Dq), 

where 

Boj=Do[X-^/3j+T^^l3o], j = l,...,q, and Do = [A-^ + t^^]-\ 
For hyperparameter A we have 

A I other parameters and data ~ IWp{d, E), 

where 

E = ^Wj - Poj)Wj - PojY + Lo and d = q + to. 
i=i 

Finahy, for the covariance matrices Si, . . . , E^, r and -R we have 
Tiu I other parameters and data ~ IWp{lu,Hu), 

where 

Hu = '^iyi- Bxi-9i){yi- Bxi-OiY + Qo and lu = nu + i^o, 

i=l 

T I other parameters and data ~ IWp{s,T), 



where 



T = '^{9i- Ziai){9i- ZiOi)'^ + ^0 and s = n + 7o, 



where 



i? I other parameters and data ~ IWpk{f, O), 

n 

O = ^^aja- + i?o and / = n + ro. 
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